The following libraries are used for this notebook:
# Load libraries
library(tidyverse)
library(tidymodels)
library(glmnet)
library(leaps)
library(naniar)
library(skimr)
library(knitr)
library(corrplot)
library(ranger)
library(doParallel)
library(themis)
library(vip)
# Read csv with listing information
data <- read_csv(gzfile("listings.csv.gz"))
38 parsing failures.
row col expected actual file
15026 license 1/0/T/F/TRUE/FALSE NL857416819B01 <connection>
15673 license 1/0/T/F/TRUE/FALSE NL825517485B01 <connection>
15961 license 1/0/T/F/TRUE/FALSE NL825517485B01 <connection>
16391 license 1/0/T/F/TRUE/FALSE 855596338B01 <connection>
16823 license 1/0/T/F/TRUE/FALSE NL854649426B01 <connection>
..... ....... .................. .............. ............
See problems(...) for more details.
The goal of the assignment is to build a model that predicts the prices of listings on AirBnB in Amsterdam. The outcomes of the model will be used for suggestions to the new hosts about the average platform price for similar listings. Then hosts can choose whether they want to use the recommendation to set their prices accordingly in order to be competitive and gain attention from the guests since the beginning. All variables including information on the reviews give information about a listing after it has been published. Therefore, this variables are not included in the data set. Moreover, the variables including a description and summary about the listing can be analyzed using NLP (e.g. sentiment analysis). However, this is beyond the scope of the assignment. Therefore, these variables are excluded from the model. The variables below are included in the data set for further analysis and cleaning.
# Generate subset with variables of interest
data_sub <- data %>%
select(id, price, property_type, room_type, accommodates, bathrooms, bedrooms,
beds, bed_type, amenities, host_since, host_response_time,
host_response_rate, host_neighbourhood, host_listings_count,
host_verifications, host_identity_verified, neighbourhood_cleansed,
square_feet, cleaning_fee, guests_included, extra_people,
minimum_nights, maximum_nights, availability_30, availability_60,
availability_90, availability_365, instant_bookable,
cancellation_policy, require_guest_profile_picture,
require_guest_phone_verification, calculated_host_listings_count,
calculated_host_listings_count_entire_homes,
calculated_host_listings_count_private_rooms,
calculated_host_listings_count_shared_rooms)
# Inspect data
head(data_sub)
# Inspect data
skim(data_sub) %>% knit_print()
| Name | data_sub |
| Number of rows | 20025 |
| Number of columns | 36 |
| _______________________ | |
| Column type frequency: | |
| character | 13 |
| Date | 1 |
| logical | 4 |
| numeric | 18 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| price | 0 | 1.00 | 5 | 9 | 0 | 479 | 0 |
| property_type | 0 | 1.00 | 3 | 22 | 0 | 34 | 0 |
| room_type | 0 | 1.00 | 10 | 15 | 0 | 4 | 0 |
| bed_type | 0 | 1.00 | 5 | 13 | 0 | 5 | 0 |
| amenities | 0 | 1.00 | 2 | 1303 | 0 | 19213 | 0 |
| host_response_time | 158 | 0.99 | 3 | 18 | 0 | 5 | 0 |
| host_response_rate | 158 | 0.99 | 2 | 4 | 0 | 61 | 0 |
| host_neighbourhood | 5972 | 0.70 | 4 | 35 | 0 | 81 | 0 |
| host_verifications | 0 | 1.00 | 2 | 158 | 0 | 381 | 0 |
| neighbourhood_cleansed | 0 | 1.00 | 4 | 38 | 0 | 22 | 0 |
| cleaning_fee | 3604 | 0.82 | 5 | 7 | 0 | 112 | 0 |
| extra_people | 0 | 1.00 | 5 | 7 | 0 | 112 | 0 |
| cancellation_policy | 0 | 1.00 | 8 | 27 | 0 | 5 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| host_since | 158 | 0.99 | 2008-09-24 | 2019-12-06 | 2015-02-08 | 3133 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| host_identity_verified | 158 | 0.99 | 0.39 | FAL: 12094, TRU: 7773 |
| instant_bookable | 0 | 1.00 | 0.26 | FAL: 14839, TRU: 5186 |
| require_guest_profile_picture | 0 | 1.00 | 0.01 | FAL: 19819, TRU: 206 |
| require_guest_phone_verification | 0 | 1.00 | 0.01 | FAL: 19758, TRU: 267 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 19117026.00 | 11473019.80 | 2818 | 9631819 | 18418621 | 27913527.0 | 40655209 | ▇▇▇▅▆ |
| accommodates | 0 | 1.00 | 2.87 | 1.30 | 1 | 2 | 2 | 4.0 | 18 | ▇▁▁▁▁ |
| bathrooms | 6 | 1.00 | 1.18 | 0.39 | 0 | 1 | 1 | 1.5 | 8 | ▇▁▁▁▁ |
| bedrooms | 13 | 1.00 | 1.45 | 0.89 | 0 | 1 | 1 | 2.0 | 12 | ▇▁▁▁▁ |
| beds | 31 | 1.00 | 1.79 | 1.41 | 0 | 1 | 1 | 2.0 | 32 | ▇▁▁▁▁ |
| host_listings_count | 158 | 0.99 | 3.85 | 28.93 | 0 | 1 | 1 | 1.0 | 751 | ▇▁▁▁▁ |
| square_feet | 19662 | 0.02 | 542.88 | 560.24 | 0 | 0 | 484 | 834.0 | 3229 | ▇▅▁▁▁ |
| guests_included | 0 | 1.00 | 1.46 | 0.95 | 1 | 1 | 1 | 2.0 | 16 | ▇▁▁▁▁ |
| minimum_nights | 0 | 1.00 | 3.43 | 14.74 | 1 | 2 | 2 | 3.0 | 1001 | ▇▁▁▁▁ |
| maximum_nights | 0 | 1.00 | 613.89 | 548.62 | 1 | 20 | 1125 | 1125.0 | 11250 | ▇▁▁▁▁ |
| availability_30 | 0 | 1.00 | 4.46 | 7.81 | 0 | 0 | 0 | 6.0 | 30 | ▇▁▁▁▁ |
| availability_60 | 0 | 1.00 | 10.05 | 17.14 | 0 | 0 | 0 | 13.0 | 60 | ▇▁▁▁▁ |
| availability_90 | 0 | 1.00 | 15.69 | 26.77 | 0 | 0 | 0 | 19.0 | 90 | ▇▁▁▁▁ |
| availability_365 | 0 | 1.00 | 47.93 | 95.34 | 0 | 0 | 0 | 37.0 | 365 | ▇▁▁▁▁ |
| calculated_host_listings_count | 0 | 1.00 | 1.97 | 5.20 | 1 | 1 | 1 | 1.0 | 72 | ▇▁▁▁▁ |
| calculated_host_listings_count_entire_homes | 0 | 1.00 | 1.50 | 5.06 | 0 | 1 | 1 | 1.0 | 72 | ▇▁▁▁▁ |
| calculated_host_listings_count_private_rooms | 0 | 1.00 | 0.38 | 1.08 | 0 | 0 | 0 | 0.0 | 16 | ▇▁▁▁▁ |
| calculated_host_listings_count_shared_rooms | 0 | 1.00 | 0.01 | 0.09 | 0 | 0 | 0 | 0.0 | 3 | ▇▁▁▁▁ |
First, we converted all categorical en logical variables, that did not need any further cleaning, to type factors.
# Convert categorical vars to factors
data_sub$property_type <- factor(data_sub$property_type ,
levels = unique(data_sub$property_type))
data_sub$room_type <- factor(data_sub$room_type ,
levels = unique(data_sub$room_type))
data_sub$bed_type <- factor(data_sub$bed_type ,
levels = unique(data_sub$bed_type))
data_sub$host_response_time <- factor(data_sub$ host_response_time,
levels = unique(data_sub$host_response_time))
data_sub$host_neighbourhood <- factor(data_sub$host_neighbourhood,
levels = unique(data_sub$host_neighbourhood))
data_sub$neighbourhood_cleansed <- factor(data_sub$neighbourhood_cleansed,
levels = unique(data_sub$neighbourhood_cleansed))
data_sub$cancellation_policy <- factor(data_sub$cancellation_policy ,
levels = unique(data_sub$cancellation_policy))
# Convert logical variables to factors
data_sub$host_identity_verified <- factor(data_sub$host_identity_verified)
data_sub$instant_bookable <- factor(data_sub$instant_bookable)
data_sub$require_guest_profile_picture <-
factor(data_sub$require_guest_profile_picture)
data_sub$require_guest_phone_verification <-
factor(data_sub$require_guest_phone_verification)
Second, some of the variables contain numeric variables, however, they are stored in a string containing a dollar or percentage sign. The signs we removed from the strings and the remaining number converted to numeric variables.
# Remove $ sign from columns containing prices and convert to doubles
data_sub$price <- as.double(gsub("[,$]", "", data_sub$price))
data_sub$cleaning_fee <- as.double(gsub("[,$]", "", data_sub$cleaning_fee))
data_sub$extra_people <- as.double(gsub("[,$]", "", data_sub$extra_people))
# Replace "N/A" values, remove % and convert to percentage
data_sub$host_response_rate <- na_if(data_sub$host_response_rate, "N/A")
data_sub$host_response_rate <-
as.double(gsub("[%]", "", data_sub$host_response_rate)) / 100
The variable amenities contains all the amenities of the listing. However, this was stored in one large string and the elements could not be access separately. Therefore, we cleaned the string, spitted it so we had a list containing the separated elements. Furthermore, we extracted all unique amenities and stored these in a vector.
amenities_unique
[1] "internet" "wifi" "paid parking off premises" "buzzer/wireless intercom"
[5] "heating" "washer" "smoke detector" "carbon monoxide detector"
[9] "first aid kit" "safety card" "fire extinguisher" "essentials"
[13] "shampoo" "lock on bedroom door" "24-hour check-in" "hangers"
[17] "hair dryer" "iron" "laptop friendly workspace" "translation missing: en.hosting_amenity_49"
[21] "translation missing: en.hosting_amenity_50" "private entrance" "hot water" "bed linens"
[25] "extra pillows and blankets" "single level home" "garden or backyard" "no stairs or steps to enter"
[29] "accessible-height bed" "host greets you" "handheld shower head" "paid parking on premises"
[33] "tv" "refrigerator" "long term stays allowed" "cable tv"
[37] "kitchen" "elevator" "indoor fireplace" "family/kid friendly"
[41] "dryer" "private living room" "well-lit path to entrance" "breakfast"
[45] "self check-in" "smart lock" "lake access" "pets live on this property"
[49] "cat(s)" "smoking allowed" "pets allowed" "microwave"
[53] "coffee maker" "dishwasher" "dishes and silverware" "cooking basics"
[57] "oven" "stove" "patio or balcony" "keypad"
[61] "luggage dropoff allowed" "baby bath" "bathtub" "babysitter recommendations"
[65] "beach essentials" "cleaning before checkout" "ev charger" "pack ’n play/travel crib"
[69] "high chair" "crib" "children’s books and toys" "room-darkening shades"
[73] "children’s dinnerware" "free street parking" "other" "extra space around bed"
[77] "wheelchair accessible" "pocket wifi" "wide hallways" "waterfront"
[81] "washer / dryer" "window guards" "air conditioning" "suitable for events"
[85] "free parking on premises" "lockbox" "wide entrance for guests" "bbq grill"
[89] "ground floor access" "dog(s)" "hot tub" "wide entrance"
[93] "accessible-height toilet" "wide entryway" "hot water kettle" "ethernet connection"
[97] "flat path to guest entrance" "wide doorway to guest bathroom" "wide clearance to shower" "toilet"
[101] "step-free shower" "gym" "outlet covers" "firm mattress"
[105] "changing table" "stair gates" "fireplace guards" "table corner guards"
[109] "game console" "baby monitor" "doorman" "building staff"
[113] "pool" "other pet(s)" "fixed grab bars for shower" "disabled parking spot"
[117] "electric profiling bed" "bathtub with bath chair" "shower gel" "fixed grab bars for toilet"
[121] "beachfront" "shower chair" "trash can" "ski-in/ski-out"
[125] "mobile hoist" "pool with pool hoist" "ceiling hoist" "full kitchen"
[129] "private bathroom" "air purifier" "bread maker" "roll-in shower with chair"
With the vector of all unique amenities seperate variables for all the separated variables could be created. However, since there are 130 usable amenities it seemed beyond the scope of the assignment. Moreover, some of the amenities contain information that is also given by other variables or about other amenities. For example a private bathroom could also be a strong indicator that the listing is an entire house/apartment. Also, shampoo or shower gel could be strong indicators for the presence of a bathroom. Furthermore, some are amenities are very specific and apply only to a few or one house. However, variables are created for some of the amenities that could have an influence on the price of a listing. We created variables for wifi, pool, hot_tub and tv.
# Create variable for WIFI and add to data set
wifi <- vector()
for(i in 1:length(data_sub$amenities_clean)) {
if("wifi" %in% data_sub$amenities_clean[[i]] |
"internet" %in% data_sub$amenities_clean[[i]]) {
wifi[i] <- "yes"
} else {
wifi[i] <- "no"
}
}
data_sub$wifi <- wifi
data_sub$wifi <- factor(data_sub$wifi, levels = c("yes", "no"))
# Create variable for pool and add to data set
pool <- vector()
for(i in 1:length(data_sub$amenities_clean)) {
if("pool" %in% data_sub$amenities_clean[[i]] |
"pool with pool hoist" %in% data_sub$amenities_clean[[i]]) {
pool[i] <- "yes"
} else {
pool[i] <- "no"
}
}
data_sub$pool <- pool
data_sub$pool <- factor(data_sub$pool, levels = c("yes", "no"))
# Create variable for hot_tub and add to data set
hot_tub <- vector()
for(i in 1:length(data_sub$amenities_clean)) {
if("hot tub" %in% data_sub$amenities_clean[[i]]) {
hot_tub[i] <- "yes"
} else {
hot_tub[i] <- "no"
}
}
data_sub$hot_tub <- hot_tub
data_sub$hot_tub <- factor(data_sub$hot_tub, levels = c("yes", "no"))
# Create variable for hot_tub and add to data set
tv <- vector()
for(i in 1:length(data_sub$amenities_clean)) {
if("tv" %in% data_sub$amenities_clean[[i]] |
"cable tv" %in% data_sub$amenities_clean[[i]]) {
tv[i] <- "yes"
} else {
tv[i] <- "no"
}
}
data_sub$tv <- tv
data_sub$tv <- factor(data_sub$tv, levels = c("yes", "no"))
For the variable host_verification the same applies as to amenities. The cleaning method as for amenities applies, first we cleaned and splitted the strings, thereafter a vector with all unique amenities is generated.
verifications_unique
[1] "email" "phone" "reviews" "jumio" "offline_government_id" "selfie" "government_id" "identity_manual"
[9] "facebook" "work_email" "none" "google" "manual_offline" "manual_online" "sent_id" "kba"
[17] "weibo" "zhima_selfie" "sesame" "sesame_offline"
Separate variables are created for the most common methods of verification, namely email, phone, facebook and government_id.
# Create variable for host email and add to data set
host_email <- vector()
for(i in 1:length(data_sub$host_verifications_clean)) {
if("email" %in% data_sub$host_verifications_clean[[i]]) {
host_email[i] <- "yes"
} else {
host_email[i] <- "no"
}
}
data_sub$host_email <- host_email
data_sub$host_email <- factor(data_sub$host_email, levels = c("yes", "no"))
# Create variable for phone and add to data set
host_phone <- vector()
for(i in 1:length(data_sub$host_verifications_clean)) {
if("phone" %in% data_sub$host_verifications_clean[[i]]) {
host_phone[i] <- "yes"
} else {
host_phone[i] <- "no"
}
}
data_sub$host_phone <- host_phone
data_sub$host_phone <- factor(data_sub$host_phone, levels = c("yes", "no"))
# Create variable for host facebook and add to data set
host_facebook <- vector()
for(i in 1:length(data_sub$host_verifications_clean)) {
if("facebook" %in% data_sub$host_verifications_clean[[i]]) {
host_facebook[i] <- "yes"
} else {
host_facebook[i] <- "no"
}
}
data_sub$host_facebook <- host_facebook
data_sub$host_facebook <-
factor(data_sub$host_facebook, levels = c("yes", "no"))
# Create variable for government id
host_government_id <- vector()
for(i in 1:length(data_sub$host_verifications_clean)) {
if("government_id" %in% data_sub$host_verifications_clean[[i]]) {
host_government_id[i] <- "yes"
} else {
host_government_id[i] <- "no"
}
}
data_sub$host_government_id <- host_government_id
data_sub$host_government_id <-
factor(data_sub$host_government_id, levels = c("yes", "no"))
We used the variable host_since to create a new variable host_years_active, which contains information on the number of years a host has been active on the platform.
# Create new variable for active years host
data_sub <- data_sub %>%
mutate(host_years_active =
as.double(as.Date("2019-12-07") - host_since) / 365)
The variables availability_30, availability_60, availability_90 and availability_365 carry some of the same information. In order to inspect if all variables should be included in the data set, we plotted a correlation matrix. The plot below shows that all variables indicating the availability are strongly correlated. Therefore, only the variable availability_30 is included for further analysis.
# Plot correlation matrix
data_sub %>% select_if(is.numeric) %>%
select(availability_30, availability_60, availability_90, availability_365) %>%
cor() %>% corrplot()
# Remove other availability variables from data set
data_sub <-
data_sub %>%
select(-availability_60, -availability_90, -availability_365)
In order to inspect which variables have missing cases and how many a table in constructed. The table shows the variables in the data set that contain missing values (in descending order). The table shows that the variable square_feet has \(19662\) missing cases, which is about \(98.19\%\). If we would deleted the missing cases, the data set will barely contain any data. Moreover, other methods for handling missing values like replacing NA-values with the mean or median would not be appropriate since the variables will be based on only \(1.81\%\) of the data. Therefore, square_feet is nog included in the final. The variable host_response_rate has \(9349\) missing cases, which is about \(46.69\%\). The variable host_neighbourhood has \(5972\), which is about \(29.82\%\). The variable cleaning_fee has \(3604\), which is about \(18.00\%\). The variables host_response_rate, host_neighbourhood and cleaning_fee do not have as many missing values as square_feet, however, the same reasoning applies. As a result, these variables are also excluded from the analyses.
# Count missing cases per variable
na_counter <- sapply(data_sub, function(x) sum(is.na(x)))
vars <- colnames(data_sub)
# Extract all variables with NA-values
na_values <- tibble(variables = vars, na_count = na_counter)
# Check na count per variable
na_values %>%
filter(na_count > 0) %>%
arrange(desc(na_count))
Moreover, a check is performed on the number of cases that contain missing values if the other values that have missing cases would be included in the ‘final’ data set. The table below shows that there are 206 cases which contain missing values, which is about \(1.03\%\) of the entire data set. Since this is a very small proportion it is not very likely that deleting these causes would have a large impact on the predictions. Furthermore, all models that will be performed cannot handle missing values. Therefore, the cases with missing values are deleted.
# Compute incomplete rows
data_sub %>%
select(host_since, host_response_time, host_listings_count,
host_identity_verified, beds, bedrooms, host_years_active,
bathrooms) %>%
complete.cases() %>%
summary(count())
Mode FALSE TRUE
logical 206 19819
# Select variables for data set
variables_analysis <-
na_values %>%
filter(na_count <= 158) %>%
select(variables) %>%
pull(variables)
# Create final data set
data_semi_final <- data_sub %>% select(all_of(variables_analysis))
data_semi_final <- data_semi_final %>%
select(-c(amenities, amenities_clean, host_verifications,
host_verifications_clean, host_since))
data_final <- data_semi_final[complete.cases(data_semi_final), ]
For further analysis the data is split into a train-test set.
# Create a train-split sets
seed_x <- 123
set.seed(seed_x)
data_split <- initial_split(data_final, prop = 0.7)
data_train <- training(data_split)
data_test <- testing(data_split)
In order to prevent data leakages we only inspect the predicted variable price in the training set. In order to inspect price we have created a distribution plot. The plot shows that the data set contains some outliers and that the distribution is rightly skewed. Both the outliers and the skeweness make the data less interpretable and this could have an influence on performance of the models.
# Plot distribution price
ggplot(data = data_train , aes(price)) +
geom_histogram(col="black",
breaks=seq(0, max(data_train$price), by=75),
aes(fill=..count..)) +
labs(title="Distribution for Price", x="Price", y="Count") +
scale_fill_gradient("Count", low="green", high="red") +
theme(plot.title = element_text(hjust = 0.5))
In order to prevent this potential problems we have created a new distribution plot with a log transformed variable price. The plot shows that the distribution is less skewed and does not contain any large outliers. Resulting, the data that is more interpretable. Therefore, log transforming we will use the log transformed price for our models.
# Plot distribution price
ggplot(data = data_train , aes(log(price + 1))) +
geom_histogram(col="black",
aes(fill=..count..)) +
labs(title="Distribution for Ln Price", x="Ln Price", y="Count") +
scale_fill_gradient("Count", low="green", high="red") +
theme(plot.title = element_text(hjust = 0.5))
# Log transform the price in for both training and test set
data_final$price <- log(data_final$price + 1)
# Resplit the data using the same seed
set.seed(seed_x)
data_split <- initial_split(data_final, prop = 0.7)
data_train <- training(data_split)
data_test <- testing(data_split)
Moreover, we have generated 10-fold cross validation sets.
# Generate 10-fold CV sets
set.seed(321)
data_folds <- vfold_cv(data_train, v = 10)
data_folds
# 10-fold cross-validation
rm(data)
rm(data_sub)
rm(data_semi_final)
rm(na_values)
In this section, regularized regression model will be specified and trained. Lasso penalty is chosen to simultaneously perform subset selection. Therefore, mixture is set to 1 in the model specification.
Specification of lasso-regularized logistic regression model, where the penalty parameter will be tuned:
#specify the model and engine used
lasso_linreg <- linear_reg(penalty = tune(), mixture = 1) %>%
set_engine("glmnet")
#check that model specified correctly:
lasso_linreg %>% translate()
In this section, the recipe is formulated. All the variables included in the final dataset are included in the recipe, in order to perform the subset selection through the lasso penalty. As the property type and bed type have some categories with just a few observations, the categories that include less than 1% of the total number of observations are combined to “other” category to avoid sparse data. Additionally, dummies are created for all of the nominal variables. Lastly, all the variables are normalized.
#prepare the recipe by setting up the regression model, setting id as id variable, combining small categories to other class, and creating dummies and normalizing variables
lasso_recipe <- recipe(price ~ .,
data = data_train) %>%
update_role(id, new_role = "ID") %>%
step_other(property_type, bed_type, threshold = 0.01, other = "other values") %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_normalize(all_predictors(), -all_outcomes())
lasso_recipe
Testing that this works properly:
#prepare and bake the data (on training set) to check that the recipe prepares the data correctly
data_baked <- lasso_recipe %>% prep(data_train) %>% bake(data_train)
head(data_baked)
#combine the model specification and reciple to a workflow
lasso_wf <- workflow() %>%
add_recipe(lasso_recipe) %>%
add_model(lasso_linreg)
lasso_wf
Next, the \(\lambda\) parameter of the lasso model will be tuned. For that purpose, a tuning grid is specified.
#set tuning grid
grid_lasso <- tibble(penalty = 10^(seq(from = -5, to = 1, length.out = 70)))
10-k-cross-validation is used to tune the lasso-penalized linear regression, and the metrics are plotted against the different values of \(\lambda\).
# perform grid search over the tuning grid of penalty values
lasso_tune <- lasso_wf %>%
tune_grid(resamples = data_folds,
grid = grid_lasso,
metrics = metric_set(mae, rmse, rsq_trad))
#save metrics in an object
lasso_tune_metrics <- lasso_tune %>%
collect_metrics()
# Plot all results metrics
lasso_tune_metrics %>%
ggplot(aes(x = penalty, y = mean,
ymin = mean - std_err, ymax = mean + std_err)) +
geom_linerange(alpha = 0.5, colour = "red") +
geom_point(colour = "red") +
facet_wrap(~ .metric, scale = "free_y") +
scale_x_log10() +
labs(y = "Lasso Performance Metrics", x = expression(lambda))
# Plot MAE
lasso_tune_metrics %>% filter(.metric == "mae") %>%
ggplot(aes(x = penalty, y = mean,
ymin = mean - std_err, ymax = mean + std_err)) +
geom_linerange(alpha = 0.5, colour = "red") +
geom_point(colour = "red") +
scale_x_log10() +
labs(y = "mae", x = expression(lambda),
title = "Lasso Regresssion MAE")
Next, the Lambda value which results in best model performance on the train set is selected. It can be seen that as the RMSE is more sensitive for large residuals, the std errors of this metrics are larger compared to the standard errors of mean absolute error (mae). Therefore, mean absolute error is used to select the best model.
#show best models with corresponding penalty values
lasso_tune %>% show_best("mae")
The best model is selected using the one standard error rule, where the simplest model that has mae inside one standard error from the absolute best model is chosen to avoid overfitting.
#select best model according to 1 std error rule
lasso_1se_model <- select_by_one_std_err(lasso_tune, metric = "mae", desc(penalty))
lasso_1se_model
As can be seen, the best model has penalty parameter of 0.007.
Finalize the workflow:
#finalize lasso wf with the selected best model
lasso_wf_tuned <-
lasso_wf %>%
finalize_workflow(lasso_1se_model)
lasso_wf_tuned
#train the tuned model on all of the train data and test on the test data
lasso_last_fit <- lasso_wf_tuned %>%
last_fit(data_split, metrics = metric_set(mae, rmse, rsq_trad))
The performance on the test set for this model is:
#collect metrics from the model on the test set
lasso_test_metrics <- lasso_last_fit %>% collect_metrics()
lasso_test_metrics
As seen above, the final lasso model has mean absolute error of 0.27, root mean squared error of 0.37 and R squared on 48,7% on the test data.
To assess the importance of the predictor variables, model parameter estimates are calculated below:
#fit the model on the training data and pull the model coefficients for the variables
lasso_wf_tuned %>% fit(data_train) %>% pull_workflow_fit() %>% tidy()
As lasso performs subset selection automatically, some variables have coefficient of zero. There is multiple variables with coefficient of zero, which implies that these variables are less important for the price prediction of new Airbnb listing. The most important variables can be identified by looking at the coefficients as well, and the 4 most important variables are number of accommodates, the number of days that the airbnb is available inside 30 days, room type of entire home apartment, and lastly, Centrum-West neighbourhood.
Within this section, a random forest will be created. First a preprocessing recipe is created. The id variable is updated to a seperate role, instead of being a predictor.
# Specify recipe
rf_recipe <- recipe(price ~ ., data = data_train) %>%
update_role(id, new_role = "id var")
rf_recipe
Data Recipe
Inputs:
Within this section the tune specificaiton are mentioned. The mtry is the number of features that are used at each split. THe exact mtry value will be tuned later on. Different values for trees where tested (200, 500 & 1000). Increasing the amount of trees did not have much impact on the results. Therefore, a tree size of 200 is chosen to save computational time.
# Tune specification
rf_tune_spec <- rand_forest(mtry = tune(), trees = 200) %>%
set_engine("ranger") %>%
set_mode("regression")
Combine the recipe and the model into a workflow that can be tuned.
# Workflow creation
rf_tune_wf <- workflow() %>%
add_recipe(rf_recipe) %>%
add_model(rf_tune_spec)
A metric set that calculates the Root Mean Square Error (rmse), the Mean Absolute Error (mae) and the R-squared (rsq_trad) is created.
# Class metrics specification
class_metrics <- metric_set(rmse, mae, rsq_trad)
The command bellow allows us to do computations in parallel.
registerDoParallel()
The tune grid was initially not optimized, but the command grid = tibble(mtry = 1:33) was utilized. This command checked all the variables. Based on the mae criteria, a mtry of 5, 6, 7, 8 & 9 was found as the optimal solution. Afterwards, a mtry of c(1:10)) is take that will include the optimal values, as well mtry values of 1 up until 4. This allows us to see that the mtry is initially increase up until it reaches its optimal mtry solution.
# Define the tune grid
rf_tunegrid <- tibble(mtry = c(1:10))
# Tune the grid
set.seed(12345)
rf_tune_res <- tune_grid(
rf_tune_wf,
resamples = data_folds,
grid = rf_tunegrid,
metrics = class_metrics
)
rf_tune_res
# Tuning results
# 10-fold cross-validation
# Collect metrics
rf_tune_res %>%
collect_metrics()
A plot for finding the best mtry, based on the criteria of the mae. A lower mae would indicate a better results, as a lower value indicates a lower error of prediction.
# Plot results all metrics
rf_tune_res %>%
collect_metrics() %>%
filter(.metric %in% c("rmse", "mae", "rsq_trad")) %>%
ggplot(aes(x = mtry, y = mean, ymin = mean - std_err, ymax = mean + std_err,
colour = .metric)) +
geom_errorbar() +
geom_line() +
geom_point() +
facet_grid(.metric ~ ., scales = "free_y") +
labs(title = "Random Forest performance metrics")
# Plot the MAE based
rf_tune_res %>%
collect_metrics() %>%
filter(.metric == "mae") %>%
ggplot(aes(x = mtry, y = mean, ymin = mean - std_err, ymax = mean + std_err)) +
geom_errorbar(colour = "red") +
geom_line(colour = "red") +
geom_point(colour = "red") +
labs(y = "mae", title = "Random Forest performance metrics")
This command will show the best mtry based on the mae criteria.
# Find the mtry with the best mae
rf_tune_res %>% show_best("mae")
The best model based on the MAE criteria is selected and eventually finalises into the workflow.
# Best model selection
best_rmse <- select_best(rf_tune_res, "mae")
final_rf <- finalize_workflow(rf_tune_wf, best_rmse)
final_rf
== Workflow ====================================================================
Preprocessor: Recipe
Model: rand_forest()
-- Preprocessor ----------------------------------------------------------------
0 Recipe Steps
-- Model -----------------------------------------------------------------------
Random Forest Model Specification (regression)
Main Arguments:
mtry = 8
trees = 200
Computational engine: ranger
Now we can train the finalized workflow on our entire training rest
# Finalise workflow on training set
final_res <- final_rf %>%
last_fit(data_split, metrics = class_metrics)
The results based on the test set will be
# Score on test data
set.seed(54321)
final_res %>%
collect_metrics()
Now we try to asses the variable importance. We will refit the model based on our previous tune parameters. We previousyly found an optimal mtry of 7, that’s why the mtry is specified as 7. However, do keep in mind that because of the random element within a random forest, that this initial value might alter. We noticed that the optimal mtry switches between 6, 7 & 8.
# Refit the model
rf_model_vi <- rand_forest(mtry = 7, trees = 200) %>%
set_engine("ranger", importance = "permutation")
rf_vi_wf <- workflow() %>%
add_model(rf_model_vi) %>%
add_recipe(rf_recipe)
# Fit the model again
set.seed(12345)
rf_vi_fit <- rf_vi_wf %>% fit(data = data_train)
We can use the refitted model in order the gather the variable importance
# Variable importance
rf_vi_fit %>% pull_workflow_fit() %>% vi()
The variable importance indicates that the accommodates, bedrooms and room_type are the most important variables for predicting the logprice. The variables which are the least important for predicting the logprice,are bed_type, pool, and wifi. Pool and wifi actually have a negative importance, bus as this is close to a value of 0, it is chosen to still include those variables.
# Plot variable importance
var_importance_plot <-
rf_vi_fit %>%
pull_workflow_fit() %>% vip(geom = "point", num_features = 12) +
labs(title = "Random Forest Variable Importance") +
theme(plot.title = element_text(hjust = 0.5))
rf_vi_fit
== Workflow [trained] ==========================================================
Preprocessor: Recipe
Model: rand_forest()
-- Preprocessor ----------------------------------------------------------------
0 Recipe Steps
-- Model -----------------------------------------------------------------------
Ranger result
Call:
ranger::ranger(formula = ..y ~ ., data = data, mtry = ~7, num.trees = ~200, importance = ~"permutation", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1))
Type: Regression
Number of trees: 200
Sample size: 13874
Number of independent variables: 33
Mtry: 7
Target node size: 5
Variable importance mode: permutation
Splitrule: variance
OOB prediction error (MSE): 0.1278328
R squared (OOB): 0.5472826
# Save plot for presentation
ggsave("plots/rf_var_importance.png", plot = var_importance_plot,
height = 7 , width = 10)
# Generate tuning grid for knn
knn_tune_grid <- tibble(neighbors = 1:50*2-1)
knn_tune_grid
Something that should be noted for this recipe, is that only numeric variables are included. This is done for the reason that categorical variables translate with difficulty to a k-nearest neighbor algorithm. The premise of prediction based on a KNN-model is that it relies exclusively on the distance between points in the data. This distance is obvious when handling numeric variables. However, when dealing with non-numeric values and variables this distance between data points cannot easily be modeled, provided they should be modeled at all. (This will have implications for determining predictions for importance and coefficients for variables, which will be addressed at the end of the section on the KNN-model).
# Specify model
knn_mod <-
nearest_neighbor(neighbors = tune()) %>%
set_mode("regression") %>%
set_engine("kknn", scale=FALSE)
knn_mod
K-Nearest Neighbor Model Specification (regression)
Main Arguments:
neighbors = tune()
Engine-Specific Arguments:
scale = FALSE
Computational engine: kknn
# Specify recipe
knn_recipe <-
recipe(price ~ ., data = data_train) %>%
step_rm(all_nominal()) %>%
update_role(id, new_role = "id var") %>%
step_normalize(all_predictors(), -id)
knn_recipe
Data Recipe
Inputs:
Operations:
Delete terms all_nominal()
Centering and scaling for all_predictors(), -id
The normalization of the data is ensured through the following commands:
# Check normalization
train_baked <- knn_recipe %>% prep(data_train) %>% bake(data_train)
train_baked %>% head()
round(colMeans(train_baked, 8))
id accommodates bathrooms bedrooms
19106945 0 0 0
beds host_listings_count guests_included extra_people
0 0 0 0
minimum_nights maximum_nights availability_30 calculated_host_listings_count
0 0 0 0
calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms calculated_host_listings_count_shared_rooms host_years_active
0 0 0 0
price
5
round(apply(train_baked, 2, sd), 8)
id accommodates bathrooms bedrooms
1.141721e+07 1.000000e+00 1.000000e+00 1.000000e+00
beds host_listings_count guests_included extra_people
1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
minimum_nights maximum_nights availability_30 calculated_host_listings_count
1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms calculated_host_listings_count_shared_rooms host_years_active
1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
price
5.313830e-01
rm(train_baked)
Below is the initial workflow for the k-nearest neighbors model is specified
# Specify workflow
knn_workflow <-
workflow() %>%
add_model(knn_mod) %>%
add_recipe(knn_recipe)
knn_workflow
== Workflow ====================================================================
Preprocessor: Recipe
Model: nearest_neighbor()
-- Preprocessor ----------------------------------------------------------------
2 Recipe Steps
* step_rm()
* step_normalize()
-- Model -----------------------------------------------------------------------
K-Nearest Neighbor Model Specification (regression)
Main Arguments:
neighbors = tune()
Engine-Specific Arguments:
scale = FALSE
Computational engine: kknn
The code below serves to specificy the assessment metrics that are used. Moreover, a grid search is performed using the validation sets.
The plot output shows some metrics that plot the mean of the performance metrics. We should aim for the mae (mean absolute error) and rmse (root mean square error) to be as low as possible, and the rsq_trad (R-squared) to be as high as possible. We used the mae metric to determine the optimal k-neighbors for our model, which arrived at 51 neighbors. This can be read from the mae graph, by looking at the corresponsing k-neighbors for the lowest mean of mae.
Moreover, from the last plot the elbow trend can somewhat clearly be seen: the metrics reach their optimum point after which the level off and slowly increase for mae and rmse and decrease for rqs_trad.
The model with the optimal number of k-nearest neighbors can then be selected as follows:
# Generate best model
knn_best_model <- select_best(knn_tune_res, metric = "mae")
Below the finalized workflow is made, which automatically picks the best KNN-model defined above (which is specified by the mae metric)
# Finalize workflow
knn_workflow_final <-
knn_workflow %>%
finalize_workflow(knn_best_model)
knn_workflow_final
== Workflow ====================================================================
Preprocessor: Recipe
Model: nearest_neighbor()
-- Preprocessor ----------------------------------------------------------------
2 Recipe Steps
* step_rm()
* step_normalize()
-- Model -----------------------------------------------------------------------
K-Nearest Neighbor Model Specification (regression)
Main Arguments:
neighbors = 51
Engine-Specific Arguments:
scale = FALSE
Computational engine: kknn
A final workflow can be set up to check the final fit. Furthermore, the performance metrics for the best KNN-model are selected and put in a table.
# Train and test the data set
knn_last_fit <-
knn_workflow_final %>%
last_fit(data_split,
metrics = metrics_reg)
KNN, as a method, does not come with a prediction for the importance or coefficients of variables. The reason for this has to do with the fact that prediction in a k-nearest neighbor model relies exclusively on the distance between data points. With this comes the added implication that no information about the relative importance of variables can be derived from it.
In order to assess the performance of the three models the results of three metrics are compared.
In order to check if the metrics are an appropriate fit, the residuals distributions of the models are plotted. Since the plots below show no large outliers and no skewed distributions, there are no implications for root mean squared error and mean absolute error.
# Generate predicted values for sales
lasso_test_preds <-
lasso_wf_tuned %>%
fit(data = data_train) %>%
predict(data_test) %>%
pull(.pred)
# Create tibble for distribution plot
lasso_pred <-
tibble(observed = data_test$price,
predicted = lasso_test_preds,
residual = observed - predicted)
# Plot distribution residuals
lasso_residual_plot <-
lasso_pred %>%
ggplot(aes(x = residual)) +
geom_density(bw = 0.15, fill = "springgreen", alpha = 0.5) +
geom_rug() +
labs(title = "Lasso Regularized Regression Distribution Residuals") +
theme(plot.title = element_text(hjust = 0.5)) +
coord_cartesian(ylim = c(0, 1.5))
lasso_residual_plot
# Generate predicted values for sales
set.seed(12345)
rf_test_preds <-
rf_vi_wf %>%
fit(data = data_train) %>%
predict(data_test) %>%
pull(.pred)
# Create tibble for distribution plot
rf_pred <-
tibble(observed = data_test$price,
predicted = rf_test_preds,
residual = observed - predicted)
# Plot distribution residuals
rf_residual_plot <-
rf_pred %>%
ggplot(aes(x = residual)) +
geom_density(bw = 0.15, fill = "springgreen", alpha = 0.5) +
geom_rug() +
labs(title = "Random Forest Distribution Residuals") +
theme(plot.title = element_text(hjust = 0.5)) +
coord_cartesian(ylim = c(0, 1.5))
rf_residual_plot
# Generate predicted values for price
knn_test_preds <-
knn_workflow_final %>%
fit(data = data_train) %>%
predict(data_test) %>%
pull(.pred)
# Create tibble for distribution plot
knn_metrics <-
tibble(observed = data_test$price,
predicted = knn_test_preds,
residual = observed - predicted)
# Plot distribution residuals
knn_residual_plot <-
knn_pred %>%
ggplot(aes(x = residual)) +
geom_density(bw = 0.15, fill = "springgreen", alpha = 0.5) +
geom_rug() +
labs(title = "KNN Distribution Residuals") +
theme(plot.title = element_text(hjust = 0.5)) +
coord_cartesian(ylim = c(0, 1.5))
knn_residual_plot
The table below shows results for the metrics for the three models. When assessing the metrics and their objectives, the results show that the random forest model peforms best on all metrics. Therefore, the random forest is considered to be the best model.
# Generate table with
lasso_metrics_compare <-
lasso_test_metrics %>%
select(-.estimator) %>%
mutate(model = "lasso regression")
rf_metrics_compare <-
final_res %>%
collect_metrics() %>%
mutate(model = "random forest")
knn_metrics_compare <-
knn_last_fit %>%
collect_metrics %>%
select(-.estimator) %>%
mutate(model = "knn reg")
lasso_metrics_compare %>%
bind_rows(rf_metrics_compare, knn_metrics_compare) %>%
select(-.estimator) %>%
pivot_wider(names_from = .metric, values_from = .estimate)